40 results
20 - AI, Human–Robot Interaction, and Natural Language Processing
- from Part V - Advances in Multimodal and Technological Context-Based Research
- Edited by Jesús Romero-Trillo, Universidad Autónoma de Madrid
-
- Book:
- The Cambridge Handbook of Language in Context
- Published online:
- 30 November 2023
- Print publication:
- 14 December 2023, pp 436-454
-
- Chapter
- Export citation
-
Summary
An AI-driven (or AI-assisted) speech or dialogue system, from an engineering perspective, can be decomposed into a pipeline with a subset of the following three distinct processing activities: (1) Speech processing that turns sampled acoustic sound waves into enriched phonetic information through automatic speech recognition (ASR), and vice versa via text-to-speech (TTS); (2) Natural Language Processing (NLP), which operates at both syntactic and semantic levels to get at the meanings of words as well as of the enriched phonetic information; (3) Dialogue processing which ties both together so that the system can function within the specified latency and semantic constraints. This perspective allows for at least three levels of context. The lowest level is phonetic, where the fundamental components of speech are built from a time-sequence string of acoustic symbols (analyzed in ASR or generated in TTS). The next higher level of context is word- or character-level, normally postulated as sequence-to-sequence modeling. The highest level of context typically used today keeps track of a conversation or topic. An even higher level of context, generally missing today, but which will be essential in future, is that of our beliefs, desires, and intentions.
Peroxisome proliferator-activated receptor gamma co-activator-1 alpha in depression and the response to electroconvulsive therapy
- Karen M. Ryan, Ian Patterson, Declan M. McLoughlin
-
- Journal:
- Psychological Medicine / Volume 49 / Issue 11 / August 2019
- Published online by Cambridge University Press:
- 07 September 2018, pp. 1859-1868
-
- Article
- Export citation
-
Background
The transcriptional coactivator peroxisome proliferator-activated receptor-γ coactivator (PGC-1α), termed the ‘master regulator of mitochondrial biogenesis’, has been implicated in stress and resilience to stress-induced depressive-like behaviours in animal models. However, there has been no study conducted to date to examine PGC-1α levels in patients with depression or in response to antidepressant treatment. Our aim was to assess PGC-1α mRNA levels in blood from healthy controls and patients with depression pre-/post-electroconvulsive therapy (ECT), and to examine the relationship between blood PGC-1α mRNA levels and clinical symptoms and outcomes with ECT.
MethodsWhole blood PGC-1α mRNA levels were analysed in samples from 67 patients with a major depressive episode and 70 healthy controls, and in patient samples following a course of ECT using quantitative real-time polymerase chain reaction (qRT-PCR). Exploratory subgroup correlational analyses were carried out to determine the relationship between PGC-1α and mood scores.
ResultsPGC-1α levels were lower in patients with depression compared with healthy controls (p = 0.03). This lower level was predominantly accounted for by patients with psychotic unipolar depression (p = 0.004). ECT did not alter PGC-1α levels in the depressed group as a whole, though exploratory analyses revealed a significant increase in PGC-1α in patients with psychotic unipolar depression post-ECT (p = 0.045). We found no relationship between PGC-1α mRNA levels and depression severity or the clinical response to ECT.
ConclusionsPGC-1α may represent a novel therapeutic target for the treatment of depression, and be a common link between various pathophysiological processes implicated in depression.
Contents
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp v-viii
-
- Chapter
- Export citation
7 - Audio analysis
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp 195-222
-
- Chapter
- Export citation
-
Summary
Analysis techniques are those used to examine, understand and interpret the content of recorded sound signals. Sometimes these lead to visualisation methods, whilst at other times they may be used in specifying some form of further processing or measurement of the audio. In this chapter we shall primarily discuss general audio analysis (rather than speech analysis which uses knowledge of the semantics, production mechanism and hearing mechanism implicit to speech).
There is a general set of analysis techniques which are common to all audio signals, and indeed to many forms of data, particularly the traditional methods used for signal processing. We have already met and used the basic technique of decomposing sound into multiple sinusoidal components with the fast Fourier transform (FFT), and have considered forming a polynomial equation to replicate audio waveform characteristics through linear prediction (LPC), but there are many other useful techniques we have not yet considered.
Most analysis techniques operate on analysis windows, or frames, of input audio. Most also require that the analysis window is a representative stationary selection of the signal (stationary in that the signal statistics and frequency distribution do not change appreciably during the time duration of the window – otherwise results may be inaccurate). We discussed the stationarity issue in Section 2.5.1, and should note that the choice of analysis window size, as well as the choice of analysis methods used, depends strongly upon the identity of the signal being analysed. Speech, noise and music all have different characteristics, and, while many of the same methods can be used in their analysis, knowledge of their characteristics leads to different analysis periods and different parameter ranges of the analysis result.
Undoubtedly, those needing to perform an analysis will require some experimentation to determine the best methods to be used, the correct parameters to be interpreted and optimal analysis timings.
We will now introduce several other methods of analysing sound that form part of the audio engineer's standard toolkit, and which can be applied in many situations. We will also touch upon the analysis of some other more specialised signals such as music and animal noises before we discuss the use of tracking sound statistics as a method of analysis.
2 - Basic audio processing
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp 9-53
-
- Chapter
- Export citation
-
Summary
Most speech and audio researchers use MATLAB as a preferred tool for audio processing, although many of us will make use of other specialised tools from time to time, such as sox for command line audio processing(particularly when there are a large number of files to convert or process, something it can do with a single command line option), and the sound capture and editing tool audacity which can record, edit, manipulate, convert and play back numerous types of audio file. In fact both of these programs are extremely capable open source tools, having far more options than could be described here. However, while very useful, neither tool can replace the abilities of MATLAB to easily develop scripts that make use of hundreds of built-in functions and operators, and can plot or visualise speech and other sounds in a multitude of ways.
Recorded speech or other sounds are stored within MATLAB (as well as in many other computer-based tools) as a vector of samples, with each individual value being a double precision floating point number. A sampled sound can be completely specified by the vector of these numbers as long as one other item of information is known: the sample rate at which the data was recorded. To replay the sampled sound, it is only necessary to sequentially output a voltage proportional to the stored vector information, with a gap between samples equivalent to the inverse of the sample rate.
General audio programs and tools store audio information similarly, except that they tend to use fixed point numbers rather than floating point, which can reduce the storage requirement by a factor of four at the expense of very little degradation – assuming the system is correctly designed. In particular, a consideration of overflow and underflow effects is usually needed when designing a system that uses fixed point storage for audio, whereas in floating point-based tools such as MATLAB this is rarely a concern in practice.
Any operation that MATLAB can perform on a general vector can, in theory, be performed on stored audio. In fact, this is how we typically perform audio processing within MATLAB, and the audio vector can be loaded and saved in much the same way as any other MATLAB variable. Likewise it can be processed, added, plotted, inverted, transformed and so on.
Preface
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp ix-xi
-
- Chapter
- Export citation
-
Summary
Humans are social creatures by nature – we are made to interact with family, neighbours and friends. Modern advances in social media notwithstanding, that interaction is best accomplished in person, using the senses of sound, sight and touch.
Despite the fact that many people would name sight as their primary sense, and the fact that it is undoubtedly important for human communications, it is our sense of hearing that we rely upon most for social interaction. Most of us need to talk to people face-to-face to really communicate, and most of us find it to be a much more efficient communications mechanism than writing, as well as being more personal. Readers who prefer email to telephone (as does the author) might also realise that their preference stems in part from being better able to regulate or control the flow of information. In fact this is a tacit agreement that verbal communications can allow a higher rate of information flow, so much so that they (we) prefer to restrict or at least manage that flow.
Human speech and hearing are also very well matched: the frequency and amplitude range of normal human speech lies well within the capabilities of our hearing system. While the hearing system has other uses apart from just listening to speech, the output of the human sound production system is very much designed to be heard by other humans. It is therefore a more specialised subsystem than is hearing. However, despite the frequency and amplitude range of speech being much smaller than our hearing system is capable of, and the precision of the speech system being lower, the symbolic nature of language and communications layers a tremendous amount of complexity on top of that limited and imperfect auditory output. To describe this another way, the human sound production mechanism is quite complex, but the speech communications system is massively more so. The difference is that the sound production mechanism is mainly handled as a motor (movement) task by the brain, whereas speech is handled at a higher conceptual level, which ties closely with our thoughts. Perhaps that also goes some way towards explaining why thoughts can sometimes be ‘heard’ as a voice or voices inside our heads?
Frontmatter
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp i-iv
-
- Chapter
- Export citation
10 - Advanced topics
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp 314-365
-
- Chapter
- Export citation
-
Summary
The preceding chapters have, as far as possible, attempted to isolate topics into welldefined areas such as speech recognition, speech processing, the human hearing system, voice production system, big data and so on. This breakdown has allowed us to discuss the relevant factors, background research and application methodology in some depth, as well as develop many MATLAB examples that are mostly self-contained demonstrations of the sub-topics themselves. However, some modern speech and audio related products and techniques span across disciplines, while others cannot fit neatly into those subdivisions discussed earlier.
In this chapter we will progress onward from the foundation of previous chapters, discussing and describing various advanced topics that combine many of the processing elements that we had met earlier, including aspects of both speech and hearing, as well as progressing beyond hearing into the very new research domain of low-frequency ultrasound.
It is hoped that this chapter, while conveying some fascinating (and a few unusual) application examples, will inspire readers to apply the knowledge that they have gained so far in many more new and exciting ways.
Speech synthesis
Speech synthesis means creating artificial speech, which could be by mechanical, electrical or other means (although our favoured approach is using MATLAB of course). There is a long history of engineers who have attempted to synthesise speech, including the famous Austrian Wolfgang von Kempelen, who published a mechanical speech synthesiser in 1791 (although it should be noted that he also invented something called ‘The Turk’, a mechanical chess playing machine which apparently astounded both public and scientists alike for many years before it was revealed that a person, curled up inside, operated the mechanism). The much more sober Charles Wheatstone, one of the fathers of electrical engineering as well as being a prolific inventor, built a synthesiser based on the work of von Kempelen in 1857, proving that the original device at least was not a hoax.
These early machines used mechanical arrangements of tubes and levers to recreate a model of the human vocal tract, with air generally being pumped through using bellows and a pitch source provided by a reed or similar (i.e. just like that used in a clarinet or oboe).
9 - Speech recognition
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp 267-313
-
- Chapter
- Export citation
-
Summary
Having considered big data in the previous chapter, we now turn our attention to speech recognition – probably the one area of speech research that has gained the most from machine learning techniques. In fact, as discussed in the introduction to Chapter 8, it was only through the application of well-trained machine learning methods that automatic speech recognition (ASR) technology was able to advance beyond a decades long plateau that limited performance, and hence the spread of further applications.
What is speech recognition?
Entire texts have been written on the subject of speech recognition, and this topic alone probably accounts for more than half of the recent research literature and computational development effort in the fields of speech and audio processing. There are good reasons for this interest, primarily driven by the wish to be able to communicate more naturally with a computer (i.e. without the use of a keyboard and mouse). This is a wish which has been around for almost as long as electronic computers have been with us. From a historical perspective we might consider identifying a hierarchy of mainstream human– computer interaction steps as follows:
Hardwired: The computer designer (i.e. engineer) ‘reprograms’ a computer, and provides input by reconnecting wires and circuits.
Card: Punched cards are used as input, printed tape as output.
Paper: Teletype input is used directly, and printed paper as output.
Alphanumeric: Electronic keyboards and monitors (visual display units), alphanumeric data.
Graphical: Mice and graphical displays enable the rise of graphical user interfaces (GUIs).
WIMP: Standardised methods of windows, icons, mouse and pointer (WIMP) interaction become predominant.
Touch: Touch-sensitive displays, particularly on smaller devices.
Speech commands: Nascent speech commands (such as voice dialling, voice commands, speech alerts), plus workable dictation capabilities and the ability to read back selected text.
Natural language: We speak to the computer in a similar way to a person, it responds similarly.
Anticipatory: The computer understands when we speak to it just like a close friend, husband or wife would, often anticipating what we will say, understanding the implied context as well as shared references or memories of past events.
Book features
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp xii-xiv
-
- Chapter
- Export citation
4 - The human auditory system
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp 85-108
-
- Chapter
- Export citation
-
Summary
A study of human hearing and the biomechanical processes involved in hearing reveals several non-linear steps, or stages, in the perception of sound. Each of these stages contributes to the eventual unequal distribution of subjective features against purely physical ones in human hearing.
Put simply, what we think we hear is quite significantly different from the physical sounds that may be present in reality (which in turn differs from what might be recorded onto a computer, given the imperfections of microphones and recording technology). By taking into account the various non-linearities in the hearing process, and some of the basic physical characteristics of the ear, nervous system, and brain, it becomes possible to begin to account for these discrepancies between perception and physical measurements.
Over the years, science and technology has incrementally improved our ability to understand and model the hearing process using purely physical data. One simple example is that of A-law compression (or the similar µ-law used in some regions of the world), where approximately logarithmic amplitude quantisation replaces the linear quantisation of PCM (pulse coded modulation): humans tend to perceive amplitude logarithmically rather than linearly, thus A-law quantisation using 8 bits to represent each sample sounds better than linear PCM quantisation using 8 bits (in truth, it can sound better than speech quantised linearly with 12 bits). It thus achieves a higher degree of subjective speech quality than PCM – for a given bitrate [4].
Physical processes
A cut-away diagram of the human ear (outer, middle and inner) is shown in Figure 4.1. The outer ear includes the pinna, which filters sound and focuses it into the external auditory canal. Sound then acts upon the eardrum, where it is transmitted and amplified through the middle ear by the three bones, the malleus, incus and stapes, to the oval window, opening on to the cochlea in the inner ear.
The cochlea, as a coiled tube, contains an approximately 35mm long semi-rigid pair of membranes (basilar and Reissner's) enclosed in a fluid called endolymph [35]. The basilar membrane carries the organs of Corti, each of which contains a number of hair cells arranged in two rows (approximately 3500 inner and 20 000 outer hair cells).
Acknowledgements
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp xv-xvi
-
- Chapter
- Export citation
Index
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp 379-386
-
- Chapter
- Export citation
6 - Speech communications
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp 140-194
-
- Chapter
- Export citation
-
Summary
Chapters 1 to 4 covered the foundations of speech signal processing including the characteristics of audio signals, methods of handling and processing them, the human speech production mechanism and the human auditory system. Chapter 5 then looked in more detail at psychoacoustics – the difference between what a human perceives and what is actually physically present. This chapter will now build upon these foundations as we embark on an exploration of the handling of speech in more depth, in particular in the coding of speech for communications purposes.
The chapter will consider typical speech processing in terms of speech coding and compression (rather than in terms of speech classification and recognition, which we will describe separately in later chapters). We will first consider the important topic of quantisation, which assumes speech to be a general audio waveform (i.e. the technique does not incorporate any specialist knowledge of the characteristics of speech).
Knowledge of speech features and characteristics allows parameterisation of the speech signal, in particular the important source filter model. Perhaps the pinnacle of achievement in these approaches is the CELP (codebook excited linear prediction) speech compression technique, which will be discussed in the final section.
Quantisation
As mentioned at the beginning of Chapter 1, audio samples need to be quantised in some way during the conversion from analogue quantities to their representations on computer. In effect, the quantisation process acts to reduce the amount of information stored: the fewer bits used to quantise the signal, the less audio information is preserved.
Most real-world systems are bandwidth (rate) or size constrained, such as an MP3 player only being able to store 4 or 8 Gbyte of audio, or a Bluetooth connected speaker only being able to replay sound at 44.1 kHz in 16 bits because this results in the maximum bandwidth audio signal that Bluetooth wireless can convey.
Manufacturers ofMP3 devices may quote how many songs their devices can store, or how many hours of audio they can contain – these are both considered more customerfriendly than specifying memory capacity in Gbytes – however, it is the memory capacity in Gbytes that tends to influence the cost of the device. It is therefore also evident that a method of reducing the size of audio recordings is important, since it allows more songs to be stored on a device with smaller memory capacity.
8 - Big data
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp 223-266
-
- Chapter
- Export citation
-
Summary
The emphasis of this book up to now has been on understanding speech, audio and hearing, and using this knowledge to discern rules for handling and processing this type of content. There are many good reasons to take such an approach, not least being that better understanding can lead to better rules and thus better processing. If an engineer is building a speech-based system, it is highly likely that the effectiveness of that system relates to the knowledge of the engineer. Conversely, a lack of understanding on the part of that engineer might lead to eventual problems with the speech system. However, this type of argument holds true only up to a point: it is no longer true if the subtle details of the content (data) become too complex for a human to understand, or when the amount of data that needs to be examined is more extensive than a human can comprehend. To put it another way, given more and more data, of greater and greater complexity, eventually the characteristics of the data exceed the capabilities of human understanding.
It is often said that we live in a data-rich world. This has been driven in part by the enormous decrease in data storage costs over the past few decades (from something like e100,000 per gigabyte in 1980, e10,000 in 1990, e10 in 2000 to e0.1 in 2010), and in part by the rapid proliferation of sensors, sensing devices and networks. Today, every smartphone, almost every computer, most new cars, televisions, medical devices, alarm systems and countless other devices include multiple sensors of different types backed up by the communications technology necessary to disseminate the sensed information.
Sensing data over a wide area can reveal much about the world in general, such as climate change, pollution, human social behaviour and so on. Over a smaller scale it can reveal much about the individual – witness targeted website advertisements, sales notifications that are driven from analysis of shopping patterns, credit ratings driven by past financial behaviour or job opportunities lost through inadvertent online presence. Data relating to the world as a whole, as well as to individuals, is increasingly available, and increasingly being ‘mined’ for hidden value.
3 - The human voice
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp 54-84
-
- Chapter
- Export citation
-
Summary
In Chapter 2 we looked at the general handling, processing and visualisation of audio: vectors or sequences of samples captured at some particular sample rate, and which together represent sound.
In this chapter, we will build upon that foundation, and use it to begin to look at (or analyse) speech. There is nothing special about speech from an audio perspective – it is simply a continuous sequence of time varying amplitudes and tones just like any other sound – it's only when a human hears it and the brain becomes involved that the sound is interpreted as being speech.
There is a famous experiment which demonstrates a sentence of something called sinewave speech. This presents a particular sound recording made from sinewaves. Initially, the brain of a listener does not consider this to be speech, and so the signal is unintelligible. However, after the corresponding sentence is heard spoken aloud in a normal way, the listener's brain suddenly ‘realises’ that the signal is in fact speech, and from then on it becomes intelligible. After that the listener does not seem to ‘unlearn’ this ability to understand sinewave speech: subsequent sentences which may be completely unintelligible to others will have become intelligible to this listener [8]. To listen to some sinewave speech, please go to the book website at http://mcloughlin.eu/sws.
There is a point to sinewave speech. It demonstrates that, while speech is just a structured set of modulated frequencies, the combination of these in a certain way has a special meaning to the brain. Music and some naturally occurring sounds also have some inherently speech-like characteristics, but we do not often mistake music for speech. It is likely that there is some kind of decision process in the human hearing system that sends speech-like sounds to one part of the brain for processing (the part that handles speech), and sends other sounds to different parts of the brain. However, there is a lot hidden inside the human brain that we do not understand, and how it handles speech is just one of those grey areas.
Fortunately speech itself is much easier to analyse and understand computationally: the speech signal is easy to capture with a microphone and record on computer. Over the years, speech characteristics have been very well researched, with many specialised analysis, handling and processing methods having been developed for this particular type of audio.
1 - Introduction
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp 1-8
-
- Chapter
- Export citation
-
Summary
Audio processing systems have been a part of many people's lives since the invention of the phonograph in the 1870s. The resulting string of innovations sparked by that disruptive technology have culminated eventually in today's portable audio devices such as Apple's iPod, and the ubiquitous MP3 (or similarly compressed) audio files that populate them. These may be listened to on portable devices, computers, as soundtracks accompanying Blu-ray films and DVDs, and in innumerable other places.
Coincidentally, the 1870s saw a related invention – that of the telephone – which has also grown to play a major role in daily life between then and now, and likewise has sparked a string of innovations down the years. Scottish born and educated Alexander Graham Bell was there at their birth to contribute to the success of both inventions. He probably would be proud to know, were he still alive today, that two entire industry sectors, named telecommunications and infotainment, were spawned by the two inventions of phonograph and telephone.
However, after 130 years, something even more unexpected has occurred: the descendents of the phonograph and the descendents of the telephone have converged into a single product called a ‘smartphone’. Dr Bell probably would not recognise the third convergence that made all of this possible, that of the digital computer – which is precisely what today's smartphone really is. At heart it is simply a very small, portable and capable computer with microphone, loudspeaker, display and wireless connectivity.
Computers and audio
The flexibility of computers means that once sound has been sampled into a digital form, it can be used, processed and reproduced in an infinite variety of ways without further degradation. It is not only computers (big or small) that rely on digital audio, so do CD players, MP3 players (including iPods), digital audio broadcast (DAB) radios, most wireless portable speakers, television and film cameras, and even modern mixing desks for ‘live’ events (and co-incidentally all of these devices contain tiny embedded computers too). Digital music and sound effects are all around us and impact our leisure activities (e.g. games, television, videos), our education (e.g. recorded lectures, broadcasts, podcasts) and our work in innumerable ways to influence, motivate and educate us.
References
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp 370-378
-
- Chapter
- Export citation
11 - Conclusion
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp 366-369
-
- Chapter
- Export citation
-
Summary
You approach the doorway ahead of you. There is no visible lock or keyhole, no keypad, not even a fingerprint or retina scanner but as you come closer you hear a voice saying ‘Welcome back sir’. You respond with ‘Hello computer’, which prompts the door to unlock and swing open to admit you. You enter through the doorway and the door closes behind you as lights turn on in the hallway you have just entered.
As you remove and hang up your coat and hat, the same voice asks you ‘Can I prepare a cup of tea or coffee for you, sir?’, to which you reply ‘Yes, Earl Grey in the living room please’. A minute or two later, sitting in a comfortable chair with your feet up and sipping a perfect cup of tea, the voice continues, ‘While you were out, you received several messages, do you wish to review them now?’
Reluctantly you reply, ‘Just summarise the most important messages for me, computer’.
‘There was a message from your brother asking if you would like to play squash tomorrow night, as well as a reminder from your secretary about the meeting early tomorrow morning. In addition, you still haven't replied to Great Aunt Sophie regarding …’
‘Skip that message,’ you interject.
The voice continues, ‘Okay, then the final message is a reminder that your car is due for servicing next week. Would you like me to schedule an appointment for that?’
‘Yes computer, make it Monday morning, around 10am. Oh, and confirm the squash for tomorrow evening at 5.30pm. If his computer hasn't booked a court yet, make sure you book one right away.’
Later you are watching a re-run of Star Trek: The Next Generation on 3D television and have a sudden idea that might help you in your job as a speech technology researcher. Before you forget it, you speak out, ‘Computer, take a brief note’. The television programme pauses before a voice announces ‘ready’, and you then speak a few sentences terminating in ‘end note’. At a command of ‘resume’, you are again confronted with Captain Jean-Luc Picard and his crew.
5 - Psychoacoustics
- Ian Vince McLoughlin
-
- Book:
- Speech and Audio Processing
- Published online:
- 05 June 2016
- Print publication:
- 21 July 2016, pp 109-139
-
- Chapter
- Export citation
-
Summary
If there is one topic that has most deeply impacted audio and speech research over the past two decades or so, it is psychoacoustics (by contrast, the deepest impact in audio and speech engineering has probably been big data – explored separately in Chapter 8). We now know that perceived sounds and speech owe just as much to psychology as they do to physiology. The state and activity of the human brain and nervous system have a profound influence on the characteristics of speech and sounds that are perceived by human listeners.
It is definitely beyond the scope of this book to delve into too much detail concerning the psychological reasons underpinning psychoacoustics – and indeed much of that detail remains to be discovered – but we will discuss, demonstrate and uncover many interesting and useful psychoacoustic phenomena in this chapter. Extensive experiments by cross-disciplinary researchers over the past two decades have allowed computational models to be developed to begin to describe the effects of psychoacoustics. While these models vary in complexity and accuracy, and continue to increase in quality and usefulness, they have already found applications in many areas of daily life. The following sections will overview many of the effects that the models can (or could) describe. We will take a fascinating look at auditory scene analysis (which includes a number of auditory-illusion style demonstrations), before building and applying our own phsychoacoustic models.
Psychoacoustic processing
The use of psychoacoustic criteria to improve communications systems, or rather to target the available resources towards more subjectively important areas, is now common. Many telephone communications systems use A-law compression. Around 1990, Philips and Sony respectively produced the DCC (digital compact cassette) and the MiniDisc formats which both make extensive use of equal-loudness contours and masking information to compress high-quality audio [54]. Whilst neither of these was a runaway market success, they introduced psychoacoustics to the music industry, and paved the way for solid state music players such as the Creative Technologies Zen micro, Apple iPod and various devices from other innovative companies.
Most of these devices use the popular MP3 compression format, although more recent formats, such as Ogg Vorbis, MP4 and various proprietary alternatives such as WMA also exist (refer to Infobox 2.2 on page 15 for descriptions of these).